{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " <table><tr><td><img src=\"images/dbmi_logo.png\" width=\"75\" height=\"73\" alt=\"Pitt Biomedical Informatics logo\"></td><td><img src=\"images/pitt_logo.png\" width=\"75\" height=\"75\" alt=\"University of Pittsburgh logo\"></td></tr></table>\n",
    " \n",
    " \n",
    " # Social Media and Data Science - Part 5\n",
    " \n",
    " \n",
    "Data science modules developed by the University of Pittsburgh Biomedical Informatics Training Program with the support of the National Library of Medicine data science supplement to the University of Pittsburgh (Grant # T15LM007059-30S1). \n",
    "\n",
    "Developed by Harry Hochheiser, harryh@pitt.edu. All errors are my responsibility.\n",
    "\n",
    "<a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc/4.0/\"><img alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc/4.0/88x31.png\" /></a><br />This work is licensed under a <a rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc/4.0/\">Creative Commons Attribution-NonCommercial 4.0 International License</a>.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Goal: Use social media posts to explore the appplication of text and natural language processing to see what might be learned from online interactions.\n",
    "\n",
    "Specifically, we will retrieve, annotate, process, and interpret Twitter data on health-related issues such as smoking."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--- \n",
    "References:\n",
    "* [Mining Twitter Data with Python (Part 1: Collecting data)](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)\n",
    "* The [Tweepy Python API for Twitter](http://www.tweepy.org/)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import operator\n",
    "import numpy as np\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "import jsonpickle\n",
    "import json\n",
    "import random\n",
    "import tweepy\n",
    "import spacy\n",
    "import time\n",
    "from datetime import datetime\n",
    "from spacy.symbols import ORTH, LEMMA, POS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.0 Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This final part of our journey through social media data retrieval, annotation, natural langauge processing, and classififcation will challenge you to apply these techniques to a new problem. Specifically, you will create, annotate, and process a new data set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.0.1 Setup\n",
    "\n",
    "As before, we start with the Tweets class and the configuration for our Twitter API connection.  We may not need this, but we'll load it in any case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Tweets:\n",
    "    \n",
    "    \n",
    "    def __init__(self,term=\"\",corpus_size=100):\n",
    "        self.tweets={}\n",
    "        if term !=\"\":\n",
    "            self.searchTwitter(term,corpus_size)\n",
    "                \n",
    "    def searchTwitter(self,term,corpus_size):\n",
    "        searchTime=datetime.now()\n",
    "        while (self.countTweets() < corpus_size):\n",
    "            new_tweets = api.search(term,lang=\"en\",tweet_mode='extended',count=corpus_size)\n",
    "            for nt_json in new_tweets:\n",
    "                nt = nt_json._json\n",
    "                if self.getTweet(nt['id_str']) is None and self.countTweets() < corpus_size:\n",
    "                    self.addTweet(nt,searchTime,term)\n",
    "            time.sleep(30)\n",
    "                \n",
    "    def addTweet(self,tweet,searchTime,term=\"\",count=0):\n",
    "        id = tweet['id_str']\n",
    "        if id not in self.tweets.keys():\n",
    "            self.tweets[id]={}\n",
    "            self.tweets[id]['tweet']=tweet\n",
    "            self.tweets[id]['count']=0\n",
    "            self.tweets[id]['searchTime']=searchTime\n",
    "            self.tweets[id]['searchTerm']=term\n",
    "        self.tweets[id]['count'] = self.tweets[id]['count'] +1\n",
    "        \n",
    "    def combineTweets(self,other):\n",
    "        for otherid in other.getIds():\n",
    "            tweet = other.getTweet(otherid)\n",
    "            searchTerm = other.getSearchTerm(otherid)\n",
    "            searchTime = other.getSearchTime(otherid)\n",
    "            self.addTweet(tweet,searchTime,searchTerm)\n",
    "        \n",
    "    def getTweet(self,id):\n",
    "        if id in self.tweets:\n",
    "            return self.tweets[id]['tweet']\n",
    "        else:\n",
    "            return None\n",
    "    \n",
    "    def getTweetCount(self,id):\n",
    "        return self.tweets[id]['count']\n",
    "    \n",
    "    def countTweets(self):\n",
    "        return len(self.tweets)\n",
    "    \n",
    "    # return a sorted list of tupes of the form (id,count), with the occurrence counts sorted in decreasing order\n",
    "    def mostFrequent(self):\n",
    "        ps = []\n",
    "        for t,entry in self.tweets.items():\n",
    "            count = entry['count']\n",
    "            ps.append((t,count))  \n",
    "        ps.sort(key=lambda x: x[1],reverse=True)\n",
    "        return ps\n",
    "    \n",
    "    # reeturns tweet IDs as a set\n",
    "    def getIds(self):\n",
    "        return set(self.tweets.keys())\n",
    "    \n",
    "    # save the tweets to a file\n",
    "    def saveTweets(self,filename):\n",
    "        json_data =jsonpickle.encode(self.tweets)\n",
    "        with open(filename,'w') as f:\n",
    "            json.dump(json_data,f)\n",
    "    \n",
    "    # read the tweets from a file \n",
    "    def readTweets(self,filename):\n",
    "        with open(filename,'r') as f:\n",
    "            json_data = json.load(f)\n",
    "            incontents = jsonpickle.decode(json_data)   \n",
    "            self.tweets=incontents\n",
    "        \n",
    "    def getSearchTerm(self,id):\n",
    "        return self.tweets[id]['searchTerm']\n",
    "    \n",
    "    def getSearchTime(self,id):\n",
    "        return self.tweets[id]['searchTime']\n",
    "    \n",
    "    def getText(self,id):\n",
    "        tweet = self.getTweet(id)\n",
    "        text=tweet['full_text']\n",
    "        if 'retweeted_status'in tweet:\n",
    "            original = tweet['retweeted_status']\n",
    "            text=original['full_text']\n",
    "        return text\n",
    "                \n",
    "    def addCode(self,id,code):\n",
    "        tweet=self.getTweet(id)\n",
    "        if 'codes' not in tweet:\n",
    "            tweet['codes']=set()\n",
    "        tweet['codes'].add(code)\n",
    "        \n",
    "   \n",
    "    def addCodes(self,id,codes):\n",
    "        for code in codes:\n",
    "            self.addCode(id,code)\n",
    "        \n",
    " \n",
    "    def getCodes(self,id):\n",
    "        tweet=self.getTweet(id)\n",
    "        if 'codes' in tweet:\n",
    "            return tweet['codes']\n",
    "        else:\n",
    "            return None\n",
    "    \n",
    "    # NEW -ROUTINE TO GET PROFILE\n",
    "    def getCodeProfile(self):\n",
    "        summary={}\n",
    "        for id in self.tweets.keys():\n",
    "            tweet=self.getTweet(id)\n",
    "            if 'codes' in tweet:\n",
    "                for code in tweet['codes']:\n",
    "                    if code not in summary:\n",
    "                            summary[code] =0\n",
    "                    summary[code]=summary[code]+1\n",
    "        sortedsummary = sorted(summary.items(),key=operator.itemgetter(0),reverse=True)\n",
    "        return sortedsummary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Put the values of your keys into these variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "consumer_key = 'YOUR-CONSUMER-KEY'\n",
    "consumer_secret = 'YOUR-CONSUMER-SECRET'\n",
    "access_token = 'YOUR-ACCESS-TOKEN'\n",
    "access_secret = 'YOUR-ACCESS-SECRET'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tweepy import OAuthHandler\n",
    "\n",
    "auth = OAuthHandler(consumer_key, consumer_secret)\n",
    "auth.set_access_token(access_token, access_secret)\n",
    "\n",
    "api = tweepy.API(auth)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will also load some routines that we defined in [Part 3](SocialMedia - Part 3.ipynb):\n",
    "    \n",
    "1. Our routine for creating a customized NLP pipeline\n",
    "2. Our routine for including tokens\n",
    "3. The `filterTweetTokens` routine defined in an exercise (Without the inclusion of named entities. It will be easier to leave them out for now)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def getTwitterNLP():\n",
    "    nlp = spacy.load('en')\n",
    "    \n",
    "    for word in nlp.Defaults.stop_words:\n",
    "        lex = nlp.vocab[word]\n",
    "        lex.is_stop = True\n",
    "    \n",
    "    special_case = [{ORTH: u'e-cigarette', LEMMA: u'e-cigarette', POS: u'NOUN'}]\n",
    "    nlp.tokenizer.add_special_case(u'e-cigarette', special_case)\n",
    "    nlp.tokenizer.add_special_case(u'E-cigarette', special_case)\n",
    "    vape_case = [{ORTH: u'vape',LEMMA:u'vape',POS: u'NOUN'}]\n",
    "    \n",
    "    vape_spellings =[u'vap',u'vape',u'vaping',u'vapor',u'Vap',u'Vape',u'Vapor',u'Vapour']\n",
    "    for v in vape_spellings:\n",
    "        nlp.tokenizer.add_special_case(v, vape_case)\n",
    "    def hashtag_pipe(doc):\n",
    "        merged_hashtag = True\n",
    "        while merged_hashtag == True:\n",
    "            merged_hashtag = False\n",
    "            for token_index,token in enumerate(doc):\n",
    "                if token.text == '#':\n",
    "                    try:\n",
    "                        nbor = token.nbor()\n",
    "                        start_index = token.idx\n",
    "                        end_index = start_index + len(token.nbor().text) + 1\n",
    "                        if doc.merge(start_index, end_index) is not None:\n",
    "                            merged_hashtag = True\n",
    "                            break\n",
    "                    except:\n",
    "                        pass\n",
    "        return doc\n",
    "    nlp.add_pipe(hashtag_pipe,first=True)\n",
    "    return nlp\n",
    "\n",
    "def includeToken(tok):\n",
    "    val =False\n",
    "    if tok.is_stop == False:\n",
    "        if tok.is_alpha == True: \n",
    "            if tok.text =='RT':\n",
    "                val = False\n",
    "            elif tok.pos_=='NOUN' or tok.pos_=='PROPN' or tok.pos_=='VERB':\n",
    "                val = True\n",
    "        elif tok.text[0]=='#' or tok.text[0]=='@':\n",
    "            val = True\n",
    "    if val== True:\n",
    "        stripped =tok.lemma_.lower().strip()\n",
    "        if len(stripped) ==0:\n",
    "            val = False\n",
    "        else:\n",
    "            val = stripped\n",
    "    return val\n",
    "\n",
    "def filterTweetTokens(tokens):\n",
    "    filtered=[]\n",
    "    for t in tokens:\n",
    "        inc = includeToken(t)\n",
    "        if inc != False:\n",
    "            filtered.append(inc)\n",
    "    return filtered"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we will include some additional modules from Scikit-Learn:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.base import TransformerMixin\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.svm import LinearSVC\n",
    "from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS\n",
    "from sklearn.metrics import accuracy_score\n",
    "import string\n",
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we're ready to go along for an exercise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Identifying the source of social media comments might be an important step in the process of interpreting a large corpus. Continuing with our example of smoking and vaping, it might be interesting to compare tweets from users - people who are talking about their own personal use  to those who might be either promoting vaping  (manufacturers, sponsors, etc.) or warning about dangers of vaping (physicians, researchers, public health agencies, etc.).\n",
    "\n",
    "A team of researchers at RTI International tackled this problem in a 2018 paper [Classification of Twitter Users Who Tweet About E-Cigarettes](http://publichealth.jmir.org/2017/3/e63/) by Annice Kim and colleagues collected tweets and attributed them to individuals, enthusiasts, \"informed agencies (news media or health community), marketers, or spammers. \n",
    "\n",
    "Your goal here is to collect a small data set and to attempt a smaller version of this challenge. Specifically, we will try to collect preliminary data for a classifier capable of identifing tweets from users of e-cigarettes vs. others.  Using any of the code found in Parts 1-4, complete these steps:\n",
    "\n",
    "1. Run some searches for tweets like 'e-cig', 'e-cigarette', 'vape' and 'vaping'. Collect a corpus of 200-300  or more tweets. You might want to save each of these result sets in files.\n",
    "\n",
    "2. Combine these tweets into one large collection using the 'Tweet' class listed above. Save the results in a file \n",
    "\n",
    "3. Annotate 50 of these tweets as pertaining to either 'individual' or 'non-individual'. Be sure that you do at least a few of the tweets from each of the original sets. One way to do this might be to randomize the tweets. Save the annotated results in a file. \n",
    "\n",
    "4.Review at the distrbution. Is it close to even? If not, do more.\n",
    "\n",
    "5. Take your annotated tweets - split them into train (80%) and test (20%) sets.  Process the train data and build a model (based on a TfIdf Vectorizer and an SVM). Evaluate the model on the test data sets.\n",
    "\n",
    "6. Test your model on the remaining tweets. What does your result look like?\n",
    "\n",
    "7. Review some of the data to identify opportunities for improvement - how might you make these models bettter?\n",
    "\n",
    "8. Reflect on the reproducibility and the reusability of the code: what should be done to make these tools easier to apply to other datasets.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "*ANSWER FOLLOWS - insert answer here*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*END ANSWER*\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.6",
   "language": "python",
   "name": "python3.6"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
